“Advanced Graphics and Data Visualization in R” is brought to you by the Centre for the Analysis of Genome Evolution & Function’s (CAGEF) bioinformatics training initiative. This CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. While the datasets and examples used in this course will be centred on SARS-CoV-2 epidemiological and genomic data, the lessons learned herein will be broadly applicable.
This lesson is the fourth in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.
The structure of the class is a code-along style in R markdown notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto datatools Hub so students can program along with the instructor.
This week will focus on standard visualizations of expression data. While our prior topics have focused on making visualizations that are applicable broadly to many types of data, a number of this week’s visualizations are used mainly on multi-dimensional data from experiments like RNAseq.
At the end of this lecture you will have covered the following topics
goseqgrey background - a package, function, code, command or
directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or
folder
bold - heading or a term that is being defined
blue text - named or unnamed
hyperlink
... - Within each coding cell this will indicate an area
of code that students will need to complete for the code cell to run
correctly.
Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn Python
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.
Each week, new lesson files will appear within your RStudio folders.
We are pulling from a GitHub repository using this Repository
git-pull link. Simply click on the link and it will take you to the
University of Toronto datatools
Hub. You will need to use your UTORid credentials to complete the
login process. From there you will find each week’s lecture files in the
directory /2024-03-Adv_Graphics_R/Lecture_XX. You will find
a partially coded skeleton.Rmd file as well as all of the
data files necessary to run the week’s lecture.
Alternatively, you can download the R-Markdown Notebook
(.Rmd) and data files from the RStudio server to your
personal computer if you would like to run independently of the Toronto
tools.
A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!
As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus.
Today’s datasets will focus on differential expression analysis. The
basic visualizations of this data after it has already been processed by
packages like DESeq2.
This is an example dataset used for building a Sankey diagram. It builds a theoretical workflow for RNAseq analysis and visualization within a manuscript. What is a Sankey diagram? We’ll find out!
RNA-Seq read count data generated from SARS-CoV2 infections of AEC cells. Used to compare the timecourse of expression (pro-inflammatory) changes in samples treated with and without HSP90 inhibitors. Published in iScience doi: https://doi.org/10.1016/j.isci.2021.102151
This is an RNAseq dataset comparison for infection of multiple cell types with pathogens like SARS-CoV-2, SARS-CoV-1, MERS-CoV, RSV (respiratory syncytial virus), IAV (influenza A virus) and HPIV3 (human parainfluenze virus type 3) published in Cell doi: 10.1016/j.cell.2020.04.026
This is an RNAseq dataset comparison for infection of human alveolar adenocarcinoma (A549) cells with and without influenza A virus (IAV) published on bioRxiv doi: https://doi.org/10.1101/2020.03.24.004655
A saved file with a final GO annotation dataset we look at with our visualizations of dot plots.
tidyverse which has a number of packages including
dplyr, tidyr, stringr,
forcats and ggplot2
viridis helps to create color-blind palettes for our
data visualizations
networkD3, gghighlight,
ggrepel, and UpSetR are used in building some
of our visualizations including flowcharts, and upset plots.
goseq, and org.Hs.eg.db are packages that
will help us perform some basic Gene Ontology (GO) annotations with our
data.
# # Install these packages onto r.datatools
# install.packages("devtools")
# install.packages("networkD3")
# install.packages("gghighlight")
# install.packages("ComplexUpset", type="source")
# install.packages("GGally")
# install.packages("ggrepel")
#
#
# # 220331: Current version of JupyterHub is not allowing goseq to be installed correctly
#
# # # Some packages can be installed via Bioconductor
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install(version = "3.18")
#
# # This should install goseq 1.54.0. Check this under the packages tab!
# BiocManager::install("goseq")
# BiocManager::install("org.Hs.eg.db")
#
# # This last part will install some older version packages so that ComplexUpset will run properly
# devtools::install_version("dplyr", version = "1.1.3", repos = "http://cran.us.r-project.org")
# devtools::install_version("ggplot2", version = "3.4.4", repos = "http://cran.us.r-project.org")
We’re playing fast and loose with some installations here to get it
to work on r.datatools.utoronto.ca so after the installation is
complete, restart your session with Session > Restart R.
Don’t refresh your browser or you’ll lose the installation.
# Packages to help tidy our data
library(tidyverse)
library(readxl)
# Packages for the graphical analysis section
library(viridis)
# Lecture 04 visualization packages
library(networkD3)
library(GGally)
library(gghighlight)
library(ggrepel)
library(ComplexUpset)
Load these libraries separately to save on a little bit of memory usage. We’ll need every last drop we can get.
# GO term analysis packages if you managed to install goseq
library(goseq)
library(org.Hs.eg.db)
Until this point, the data sets we have used in lecture have been relatively simple epidemiological information. All of our observations are values like new cases or vaccinated individuals tied to dates or time periods. Whereas most of our observations were connected in some linear fashion, the data we examine today will have tens of thousands of observations, each representing a different gene!
Given the complexity of our data, we will discuss ways to sort, partition, visualize and interpret it.
Many of you may be familiar with the idea of RNAseq data. It begins in the real world (or at the bench) with individuals/groups/conditions compared against a control state. Biological samples can be collected in time series or at a single endpoint. Each experimental condition should have it’s matching control or baseline profile for comparison, not to mention biological replicates!
After collecting your samples you still need to prepare and sequence your libraries, check your data quality, trim reads, map them to a transcriptome, estimate/normalize the expression of genes within each library and then compare the expression between libraries/samples to determine if there are transcriptional differences! All of these steps are beyond this class BUT if you plan on working with transcriptomes, you’ll eventually encounter this process. Today we’ll jump the line and work directly with differential expression (DE) data. That means we’ll be at the bottom right corner of the following diagram AKA, the end of the pipeline.
Image from Bioinformatics Workbook: RNA Sequencing Analysis
Sankey diagrams are named after Irish Captain Matthew Henry Phineas Riall Sankey, who first presented this type of visualization in 1898 to convey the energy efficiency of a steam engine!
With larger sets of data across multiple experiments, it can sometimes be helpful to use a Sankey diagram to explain the connection between different aspects of your samples. The key attribute of the Sankey diagram is that the width of a connection (flow) is proportional to the quantity represented. So the Sankey diagram is a flow diagram that also visually quantifies the dynamics of the relationships in your diagram. Think of it much like a stacked bar chart that can split or join with other bars at each connection.
A useful package to generate Sankey diagrams is
networkD3 using the sankeyNetwork() function.
To generate a Sankey diagram, you need information on two variables with
a third optional one:
source: a starting point or for your flowtarget: the end point of your flowvalue: the number that will represent the width of your
flow. Usually some quantity of contribution but this can also be an
optional variable.We’ll take the time to reconstruct this sankey plot in the following sections.
We’ll find a pre-made table that describes a theoretical RNAseq
workflow based in the datasets we’ll be working with today. Some
sections have been compressed for brevity but we’ll find all of that
information in ./data/Lecture04_sankey_data.csv. Let’s
begin by reading it in.
# Read in our Sankey diagram in ./data/Lecture04_sankey_data.csv
sankey_info.df <- read_csv("data/Lecture04_sankey_data.csv")
Rows: 24 Columns: 3── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (2): source, target
dbl (1): value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check the structure
str(sankey_info.df)
spc_tbl_ [24 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ source: chr [1:24] "CoV" "CoV" "IVS" "RSV" ...
$ target: chr [1:24] "A549" "NHBE" "A549" "A549" ...
$ value : num [1:24] 0.25 0.25 0.25 0.25 0.7 0.2 0.4 0.4 0.15 0.15 ...
- attr(*, "spec")=
.. cols(
.. source = col_character(),
.. target = col_character(),
.. value = col_double()
.. )
- attr(*, "problems")=<externalptr>
# Look a few rows
head(sankey_info.df)
From the above output we can see that one source may
have multiple target values and vice versa.
Although our data frame sankey_info.df contains the
information about our connections, the networkD3 package
requires some mapping of the nodes within our data. Each unique source
and target will be counted as a node and we’ll save that information
into sankey_nodes.df.
# Combine our source and target lists, then take the unique set and put that into the "name" vector.
# sankey_nodes.df <- data.frame(name=c(as.character(sankey_info.df$source), # Location of our source nodes
# as.character(sankey_info.df$target)) %>% # Location of our target nodes
# unique() # Generate the unique combination of the two sets
# )
# Combine our target and source nodes into a single vector
sankey_nodes.df <- c(sankey_info.df$source,
as.character(sankey_info.df$target)) %>%
# Only keep the unique values
unique() %>%
# Put this into a data frame under a column called "name"
data.frame(name = .)
# Look at the structure, it's just a list of the nodes
str(sankey_nodes.df)
'data.frame': 18 obs. of 1 variable:
$ name: chr "CoV" "IVS" "RSV" "A549" ...
match()Now that we have our node information (18 total unique nodes) stored
basically as an identification number, we have to map it back to
sankey_info.df. This is essentially like working with a
factor that spans two columns instead of just one. To accomplish the
mapping, we’ll use the match() method. The
match() function takes two values:
x: the values to be matched
table: the values to be matched against
and returns a vector of length x where each integer
value represents the position of its matched value from
table. Unmatched values from x (ie not found
in table) are set to nomatch.
match() vs “%in%”: The match() function and the matching operator %in% are very similar in idea. Whereas the former produced a set of matched positions between sets, the latter produces a logical vector indicating inclusion (TRUE) or exclusion (FALSE) from the intersect of the sets.
Let’s try with a quick example with our data
# Which nodes match from our source nodes to our collected nodes (18 total nodes)?
match(sankey_info.df$source, sankey_nodes.df$name)
[1] 1 1 2 3 4 5 6 6 7 7 7 7 7 8 9 10 11 11 9 12 12 13 13 14
Note that our node information must be 0-indexed! R uses an indexing scheme that begins at 1 so we’ll need to offset that information that we generate as well.
# Add our sankey_nodes.df as index information that maps between it and `sankey_info.df`
sankey_info.df <-
sankey_info.df %>% mutate(IDsource = match(.$source, sankey_nodes.df$name) -1,
IDtarget = match(.$target, sankey_nodes.df$name) -1) # nodes must be 0-indexed
# take a peek at what we've wrought
head(sankey_info.df)
sankeyNetwork()Now that we have our two necessary data frames, we can build our
Sankey diagram with the sankeyNetwork() function. This
function has a few parameters that we’ll want to set:
Links: the location of our source/target
relationship information (sankey_info.df).
Nodes: the listing of nodes in our diagram
(sankey_nodes.df).
Source: the name of the source column
(IDsource).
Target: the name of our target column
(IDtarget).
Value: the name of our “optional” value to build the
width of flows (value).
NodeID: the names of our nodes
(sankey_nodes.df$name).
There are an additional number of parameters that allow you to play with the visualization of this diagram including:
colourScale: categorical colour of your node. Works
together with…
NodeGroup: for colouring your nodes based on an
extra column in Nodes.
LinkGroup: for colouring your links based on an
extra column in Links.
nodeWidth: the numeric width of each node.
For more information on these parameters and others, you can go here
# Build the Sankey diagram with all of our information
sankeyNetwork(Links = sankey_info.df, # Where is the "edge" information
Nodes = sankey_nodes.df, # Where is the "node" information
Source = "IDsource", # Where are the source node labels
Target = "IDtarget", # Where are the target node labels
Value = "value", # What value do we assign to the relationships/edges
NodeID = "name", # Where are the labels for the nodes stored?
sinksRight = TRUE, # Push the nodes to the right
fontSize = 10)
Links is a tbl_df. Converting to a plain data frame.
Stepping back one step in our flowchart, just prior to your DE analysis, you’ll have a set of read counts generated from RNA-Seq data. The counts are an estimation of each transcript within your data. Some programs may include the analysis of spliced isoforms and others may not.
Often in your RNA-Seq datasets, you should have multiple replicates of your data. A quick way to assess the quality of your datasets is to generate a scatterplot matrix. The scatterplot matrix is an all-by-all assessment of your experiments that generates a scatterplot of read counts for each pair-wise dataset.
A scatterplot matrix can help to quickly assess the quality and consistency of your data.
Figure taken from: Visualization methods for differential expression analysis. Rutter et al., 2019. BMC Bioinformatics 20: 458
We’ll begin by importing our data from
Wyler2020_AEC_SARSCoV2_17AAG_readcounts.tsv. This
tab-separated data file contains total RNA experiments from the
infection of airway epithelial cells (AECs) under different conditions,
treatments with inhibitors, and timepoints.
# Read in your read_count data
wyler_readcounts.df <- read_tsv("./data/Wyler2020_AEC_SARSCoV2_17AAG_readcounts.tsv")
Rows: 27011 Columns: 38── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): gene
dbl (37): width, AECII_01_24h_DMSO-1, AECII_02_24h_DMSO-2, AECII_03_24h_DMSO-3, AECII_04_24h_200nM17AAG-1, AECII_05_24h_200nM17AAG-2, AECII_06_2...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(wyler_readcounts.df, give.attr = FALSE)
spc_tbl_ [27,011 × 38] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ gene : chr [1:27011] "A1BG" "A1BG-AS1" "A1CF" "A2M" ...
$ width : num [1:27011] 1766 2130 9529 4653 2187 ...
$ AECII_01_24h_DMSO-1 : num [1:27011] 22 55 0 0 0 ...
$ AECII_02_24h_DMSO-2 : num [1:27011] 19 20 0 1 0 ...
$ AECII_03_24h_DMSO-3 : num [1:27011] 15 38 0 0 2 ...
$ AECII_04_24h_200nM17AAG-1 : num [1:27011] 29 11 0 0 2 4740 0 0 882 0 ...
$ AECII_05_24h_200nM17AAG-2 : num [1:27011] 27 15 0 0 0 ...
$ AECII_06_24h_200nM17AAG-3 : num [1:27011] 30 17 0 0 2 4220 0 0 820 0 ...
$ AECII_07_24h_S2_DMSO-1 : num [1:27011] 24 41 0 0 0 ...
$ AECII_08_24h_S2_DMSO-2 : num [1:27011] 18 46 0 0 2 ...
$ AECII_09_24h_S2_DMSO-3 : num [1:27011] 22 42 0 0 4 2750 0 0 739 0 ...
$ AECII_10_24h_S2_200nM17AAG-1: num [1:27011] 23 13 0 1 3 ...
$ AECII_11_24h_S2_200nM17AAG-2: num [1:27011] 25 24 0 0 1 ...
$ AECII_12_24h_S2_200nM17AAG-3: num [1:27011] 18 28 1 0 1 ...
$ AECII_13_48h_DMSO-1 : num [1:27011] 24 57 0 0 0 ...
$ AECII_14_48h_DMSO-2 : num [1:27011] 38 56 0 0 0 1930 0 0 892 0 ...
$ AECII_15_48h_DMSO-3 : num [1:27011] 33 64 0 0 0 ...
$ AECII_16_48h_200nM17AAG-1 : num [1:27011] 29 35 0 0 1 ...
$ AECII_17_48h_200nM17AAG-2 : num [1:27011] 20 24 0 0 0 ...
$ AECII_18_48h_200nM17AAG-3 : num [1:27011] 23 28 0 0 0 ...
$ AECII_19_48h_S2_DMSO-1 : num [1:27011] 13 45 0 0 0 ...
$ AECII_20_48h_S2_DMSO-2 : num [1:27011] 22 36 0 0 0 ...
$ AECII_21_48h_S2_DMSO-3 : num [1:27011] 21 30 0 0 1 ...
$ AECII_22_48h_S2_200nM17AAG-1: num [1:27011] 31 29 0 0 1 ...
$ AECII_23_48h_S2_200nM17AAG-2: num [1:27011] 36 23 0 0 1 ...
$ AECII_24_48h_S2_200nM17AAG-3: num [1:27011] 20 19 1 0 0 ...
$ AECII_25_72h_DMSO-1 : num [1:27011] 35 43 0 1 2 ...
$ AECII_26_72h_DMSO-2 : num [1:27011] 26 38 0 0 1 997 0 0 753 0 ...
$ AECII_27_72h_DMSO-3 : num [1:27011] 29 53 0 0 4 1230 0 0 953 0 ...
$ AECII_28_72h_200nM17AAG-1 : num [1:27011] 32 57 0 0 2 ...
$ AECII_29_72h_200nM17AAG-2 : num [1:27011] 41 49 0 0 2 1050 0 0 553 0 ...
$ AECII_30_72h_200nM17AAG-3 : num [1:27011] 33 37 0 0 3 857 0 0 458 0 ...
$ AECII_31_72h_S2_DMSO-1 : num [1:27011] 24 58 0 0 1 ...
$ AECII_32_72h_S2_DMSO-2 : num [1:27011] 33 63 0 0 3 ...
$ AECII_33_72h_S2_DMSO-3 : num [1:27011] 23 47 1 1 2 ...
$ AECII_34_72h_S2_200nM17AAG-1: num [1:27011] 22 39 0 1 1 ...
$ AECII_35_72h_S2_200nM17AAG-2: num [1:27011] 28 26 0 0 2 866 0 0 481 0 ...
$ AECII_36_72h_S2_200nM17AAG-3: num [1:27011] 43 34 0 0 3 ...
Before we proceed to generating our scatterplot matrix, we need to
select our data from our main dataset. In particular, the data from
Wyler et al., that we imported as
wyler_readcounts.df contains data from infecting human
airway epithelial cells (AECs) and treating with either
DMSO or 200nM of the HSP90 inhibitor,
17-AAG. With each condition there is a set of infected
(S2) or uninfected AECs. For each of these 4
conditions, there are 3 replicates and 3 timepoints
(24-, 48-, and
72-hours). In total that amounts to 36 sets of
data.
For our purposes, we’ll:
Compare 24h vs 72h infected (ie S2 = SARS-CoV2) DMSO-treated
found in columns 9-11 and 33-35. We can either select by index or use a
selection helper like matches() to incorporate a regular
expression.
Filter our reads to the range of 10-5000 using the
if_all() function.
Rename our columns to remove some redundant data
(AECII_xx_)
Note that the if_all() function is a rather new addition
to the tidyverse and is used to help filter on a predicate across
multiple columns. This is very much like the
across() helper function you may have used previously for
summarizing data. Like the across() function, the
if_all() helper function uses the following parameters:
.cols: the columns you wish to apply your filtering
predicate upon.
.fns: the function or predicate you are filtering
with.
You’ll notice our use of ~ again to anonymize the
function.
readcounts_24hv72h <-
Warning messages:
1: In findGeneric(f, envir) :
'select' is a formal generic function; S3 methods will not likely be found
2: In findGeneric(f, envir) :
'select' is a formal generic function; S3 methods will not likely be found
3: In findGeneric(f, envir) :
'select' is a formal generic function; S3 methods will not likely be found
4: In findGeneric(f, envir) :
'select' is a formal generic function; S3 methods will not likely be found
5: In findGeneric(f, envir) :
'select' is a formal generic function; S3 methods will not likely be found
6: In findGeneric(f, envir) :
'select' is a formal generic function; S3 methods will not likely be found
7: In findGeneric(f, envir) :
'select' is a formal generic function; S3 methods will not likely be found
8: In findGeneric(f, envir) :
'select' is a formal generic function; S3 methods will not likely be found
9: In findGeneric(f, envir) :
'select' is a formal generic function; S3 methods will not likely be found
10: In findGeneric(f, envir) :
'select' is a formal generic function; S3 methods will not likely be found
wyler_readcounts.df %>%
# Select for just the columns withs SARS-CoV-2 infection and DMSO at 24 and 72h
dplyr::select(matches(r"((24h|72h)_S2_DMSO)")) %>%
# Restrict our data analysis to just readcounts above 10 and below 5000
filter(if_all(.cols = everything(), .fns = ~ .x >10 & .x < 5000)) %>%
# Rename the columns by removing the first portion: AECII_xx
rename_with(., ~ str_replace(string = .x,
pattern = r"(\w*_\d*_)",
replace = ""))
# Take a peek at our resulting tibble
str(readcounts_24hv72h)
tibble [12,430 × 6] (S3: tbl_df/tbl/data.frame)
$ 24h_S2_DMSO-1: num [1:12430] 24 41 2534 712 204 ...
$ 24h_S2_DMSO-2: num [1:12430] 18 46 2484 654 205 ...
$ 24h_S2_DMSO-3: num [1:12430] 22 42 2750 739 225 ...
$ 72h_S2_DMSO-1: num [1:12430] 24 58 958 1065 147 ...
$ 72h_S2_DMSO-2: num [1:12430] 33 63 1163 1081 156 ...
$ 72h_S2_DMSO-3: num [1:12430] 23 47 835 1022 124 ...
ggpairs() to generate a scatterplot
matrixYou may recall our working with the parallel coordinate plot from
lecture 2. Rather than transform the data ourselves, we
passed our data along to the ggparcoord() function from the
GGally package. Well that package is here to simplify our
lives again! Rather than generating the paired data between each
experiment ourselves, we can pass along the read count matrix we just
generated to the ggpairs() function. Parameters of the
ggpairs() function that we’ll use include:
data: the dataset containing our data. It can include
both numerical and categorical data.mapping: the aesthetic mapping (aside from
x and y) used for altering your format.columns: which columns from data are used
to generate the plot. Defaults to all columns.title, xlab, ylab: the titles
of your various graph components.upper and lower: named lists that may
contain the variables “continuous”, “combo”, “discrete”, and “na” used
to determine how pairwise combinations of continuous and/or categorical
data are plotted across the grid. You basically need to choose how these
data combinations will be plotted.diag: a named list that may only contain the variables
“continuous”, “discrete”, and “na”. Each of which is set to a specific
kind of plot type (ie “densityDiag”, “barDiag”, or “blankDiag”).Moreover, you may recall that GGally works with the
ggplot or is ggplot-extensible so we
can use many of the same features from ggplot to fix some of the
aesthetics of the result object. In the end, you’ll notice that this is
basically a faceted scatterplot but the dataframe required to generate
it is more complex than what we currently have. For more information on
this function, you can check out the GGally manual
or check out the list of
great examples provided on the authors’ github.
# Create your matrix scatterplot with ggpairs
ggpairs(readcounts_24hv72h,
# Set the alpha of our points so we can see them better
mapping = aes(alpha = 0.1),
# We'll play with the correlation text so the values are little larger than default for us
upper = list(continuous = wrap("cor", size = 9))
) + # Set the alpha of our points so we can see them better
# Change some of the theme aspects like text size
theme(text = element_text(size = 20),
# Rotate the x-axis tick text to be
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
Work smarter, not harder: When it comes to certain complex visualizations, it is best to avoid re-inventing the wheel. There are so many packages available that accomplish a great number of different visualizations. The key is to knowing what kind of visualization you want to produce, and then checking if a package exists. While the ggplot package can accomplish quite a few basic and complexly layered visualizations, ones like the above scatterplot matrix require the generation and replacement of multiple plot types. This requires going to a deeper level of knowledge beyond this course. A nice package like GGally takes away those issues while offering the ability to still customize and alter your plot format to a certain extent.
As you can see, the scatterplot matrix can provide a quick way to check that your replicates are similar, while your comparison conditions should show more variation between them. Looking at our above plot, the triangular sections of scatterplots in the top-left and lower-right represent the comparison of replicates in our experiment. These should produce scatter points that are close to the diagonal - meaning the replicates are quite similar in the data produced.
In the lower-left corner of the plot we find the comparison between our two conditions (24- vs 72-hour data) and you can see much more scatter away from the diagonal. If our conditions really do induce different transcriptional profiles, this is what we should expected to see.
Across the diagonal we see the density plot of our read counts. These aren’t particularly helpful with this dataset as the data predominantly appears to have low read counts even after we filter for a minimum of 10 reads per gene.
This technique is also a useful way to identify potentially mislabeled samples before processing your data for analysis. Changes from the above pattern can signal the mislabeling of samples or changes to the process of sample preparation.
Not just for expression analysis! While we have use the scatterplot matrices in this section to compare our RNAseq datasets, these plot formats have a much broader use in determining if any possible linear correlation may exist between continuous variables. It’s a quick way to look at a wide scope of variables for possible relationships without doing all the in-depth analysis first. Think of it as yet another helpful tool for exploratory data analysis!
Digging down into your DE results, with the human genome, you are looking to identify trends in gene expression differences across 43,000 potential transcripts sequences - only half of which produce proteins. You can imagine that opening up a tabular file of that size could take a bit of time.
For each DE experiment there are a number of values generated. In a
popular package like DESeq2 you will typically find:
| Variable | Description |
|---|---|
| Gene name | The names of your genes which may be in different formats like Entrez ID, Ensembl, or gene symbol |
| Base mean | Average of the normalized counts values, accounting for size factors, taken over all samples (and or replicates) |
| Log\(_{2}\) fold-change | The effect size estimate - how much your gene’s expression has changed from control/base conditions |
| Log fold-change SE | An estimate of effect size uncertainty |
| p-value | Hypothesis test against H\(_{0}\) that no difference exists between sample groups |
| adjusted p-value | An adjusted p-value after taking into account multiple testing between groups/samples |
You can find more information on working with the DESeq2
package here
Once we have a set of data we can examine it from many aspects:
Let’s explore these different aspects and the kind of plots we can generate to accomodate the size of our data set. First, however, we need to find a useful data set to work with.
readxl library to open your .xlsx
filesWe’ve briefly touched on using this package in
Lecture 01 to help read our Microsoft Excel-based data.
Today’s data set from Blanc-Melo et al. (Cell 2020), comes to us in the
form of an excel file. Let’s use some of the tools from this package to
help open up our data file. We’ll start by peeking at how many sheets
there are using excel_sheets().
# What are the sheets to open in our data set?
excel_sheets("data/Blanco-Melo2020Cell.Supp1.xlsx")
[1] "Legend" "DESeq2_Results"
read_excel()Now that we can see there are two sheets in our data, we can assign
each one to a different data frame using the read_excel()
command which contains the parameter sheet. We can use this
to determine which sheet to load either based on it’s position (integer)
or it’s name as specified by excel_sheets().
# Assign our legend
blanco_legend.df <- read_excel("./data/Blanco-Melo2020Cell.Supp1.xlsx", sheet=1)
# Assign our data
blanco_data.df <- read_excel("./data/Blanco-Melo2020Cell.Supp1.xlsx", sheet=2)
str(blanco_legend.df)
tibble [20 × 3] (S3: tbl_df/tbl/data.frame)
$ Field : chr [1:20] "SARS-CoV-2(A549)_L2FC" "SARS-CoV-1_L2FC" "MERS-CoV_L2FC" "RSV_L2FC" ...
$ Explanation: chr [1:20] "Fold expresion change Infected vs Control (logarithmic scale to base 2)" "Fold expresion change Infected vs Control (logarithmic scale to base 2)" "Fold expresion change Infected vs Control (logarithmic scale to base 2)" "Fold expresion change Infected vs Control (logarithmic scale to base 2)" ...
$ Notes : chr [1:20] "SARS-CoV-2 infection (A549 cells, MOI: 2, 24hpi)" "SARS-CoV-1 infection (MRC5 cells, MOI: 3, 24hpi Data from: GSE56192)" "MERS-CoV infection (MRC5 cells, MOI: 3, 24hpi Data from: GSE56192)" "RSV (A549 cells, MOI: 2, 24hpi)" ...
str(blanco_data.df)
tibble [23,710 × 21] (S3: tbl_df/tbl/data.frame)
$ GeneName : chr [1:23710] "IFNL1" "IFNL2" "IFNL3" "IFNB1" ...
$ SARS-CoV-1_L2FC : num [1:23710] 0 0 0 0 0 ...
$ IAV_L2FC : num [1:23710] 1.37 1.379 1.815 0.468 0.24 ...
$ MERS-CoV_L2FC : num [1:23710] 0 0.125 0 0 0.125 ...
$ HPIV3_L2FC : num [1:23710] 7.84 6.84 6.23 6.44 2.62 ...
$ RSV_L2FC : num [1:23710] 7.14 4.84 4.83 4.86 0 ...
$ SARS-CoV-2(Calu-3)_L2FC : num [1:23710] 7.24 7.88 7.47 8.7 3.33 ...
$ SARS-CoV-2(A549)_L2FC : num [1:23710] 0 0 0 -0.127 0 ...
$ SARS-CoV-2(A549-ACE2)LowMOI_L2FC : num [1:23710] 0 0 0 0.388 0 ...
$ SARS-CoV-2(A549-ACE2)HiMOI_L2FC : num [1:23710] 5.08 5.25 5.05 5.76 1.54 ...
$ SARS-CoV-2(A549-ACE2)-Ruxolitinib_L2FC: num [1:23710] 3.548 1.947 2.511 3.92 0.297 ...
$ padj_SARS-CoV-1 : num [1:23710] 1 1 1 1 1 ...
$ padj_IAV : num [1:23710] 1.00 1.00 1.00 1.00 1.00 1.11e-06 1.00 1.00 1.00 1.00 ...
$ padj_MERS-CoV : num [1:23710] 1 1 1 1 1 ...
$ padj_HPIV3 : num [1:23710] 2.62e-68 1.48e-48 3.66e-35 2.13e-42 1.00 ...
$ padj_RSV : num [1:23710] 4.51e-44 9.74e-17 3.19e-16 4.92e-17 1.00 ...
$ padj_SARS-CoV-2(Calu-3) : num [1:23710] 2.53e-139 5.69e-107 5.51e-102 0.00 4.20e-11 ...
$ padj_SARS-CoV-2(A549) : num [1:23710] 1 1 1 1 1 ...
$ padj_SARS-CoV-2(A549-ACE2)LowMOI : num [1:23710] 0 0 0 0 0 ...
$ padj_SARS-CoV-2(A549-ACE2)HiMOI : num [1:23710] 1.00 2.69e-20 2.37e-18 1.15e-28 1.00 ...
$ padj_SARS-CoV-2(A549-ACE2)-Ruxolitinib: num [1:23710] 1.00 1.00 1.00 1.54e-12 1.00 ...
Yes, it looks like the legend has some helpful information about the dataset itself found in the second sheet. The DE data looks like it’s split across 20 columns encompassing 10 experimental data sets. For each data set, it looks like there is a Log\(_{2}\) fold-change value (L2FC) and an adjusted p-value (padj).
Let’s take a closer look at blanco_legend.df first. From
it we can parse out some important information about the experiments
themselves like the pathogen and host
in each experimental set.
# Look at the entire legend
blanco_legend.df
str_match_all()You may remember that we can easily match for patterns in our string
using functions from the stringr package but with a
function like str_match_all() you can also include matching
groups to your pattern that can be captured. For each
string match, this function will return a matrix that contains the
complete matching pattern and any grouped patterns that are denoted by
the () parentheses.
If a string contains multiple matches to your pattern, it will generate additional rows in the matrix for that string.
Looking at our Notes column of
blanco_legend.df, it looks like it follows a regular
pattern were we can extract the pathogen and host cell information
| section | Examples |
|---|---|
| Pathogen name | SARS-CoV-2, RSV |
| Intermediate strings | infection, infection with Ruxolitinib |
| Host cell type | A549, A549-ACE2 |
# pattern match for pathogen and host information and save it into a datafram you can access later.
exp_info <-
blanco_legend.df %>%
# Use dplyr::select because the select function has been overridden by another library
dplyr::select(Notes) %>%
# Use str_match_all to grab the pattern we want, piece by piece
# example string: SARS-CoV-2 infection (A549-ACE2 cells, MOI: 2, 24hpi)
str_match_all(pattern = r"(([\w-]+)[\s\w]+\(([\w-]+)\s*cells)") %>%
# Set into a data frame
as.data.frame() %>%
# Change the column names (would be X1, X2, X3 otherwise)
`colnames<-`(c("notes", "pathogen", "host"))
Warning: argument is not an atomic vector; coercing
# Alternative code to set names:
# magrittr::set_colnames(c("notes", "pathogen", "host"))
# Take a look at the resulting data
exp_info
# Add this new information to our original legend information
blanco_exp_info.df <-
blanco_legend.df %>%
# Make some new columns with pathogen and host information
mutate(pathogen = exp_info$pathogen, # Our pathogen information
host = exp_info$host) %>% # Our host information
# Combine our pathogen and host information into a single column as well
unite(col = exp_info, pathogen:host, sep="_", remove = FALSE) %>%
# Just keep the original Field column and the new ones
dplyr::select(1, 4, 5, 6)
# Take a look at what we are working with
blanco_exp_info.df
Now that we have catalogued our experimental information, let’s build
that into a long-format version of our actual DE dataset found in
blanco_data.df. The steps we’ll take to prepare our data
are
Convert from wide to long format with
pivot_longer().
Join with our experimental data information in
blanco_exp_info.df
Replace our experiment information (formerly column information)
to a consistent format of experiment_value so we can split
it out properly.
Separate our experiment_value information. Currently
we’ll have experiments mapping to both an L2FC and padj value
Convert our data back out a little bit to a slightly wider format
with pivot_wider()
blanco_data_long.df <-
blanco_data.df %>%
# collapse all of the data columns
pivot_longer(cols=(2:21), names_to = "experiment", values_to = "value") %>%
# add our experimental info to each observation. Note that each gene has multiple obs now.
full_join(., blanco_exp_info.df, by=c("experiment" = "Field")) %>%
# We need to fix all of the "padj_experiment" values to "experiment_padj" format
# Note we could have renamed our columns BEFORE pivoting
mutate(experiment = str_replace_all(.$experiment,
pattern=r"((padj)_([\w-\(\)]*))", # Capture the groups
# Put them back together in a switched order
replacement = r"(\2_\1)")) %>%
# Now we want to split the experiment column into two: experiment and data_type (L2FC or padj)
separate(experiment, c("experiment", "data_type"), sep="_", remove=TRUE) %>%
# Pivot out the data so that each observation for each experiment has an L2FC and padj value
pivot_wider(names_from = data_type, values_from = value)
blanco_data_long.df
# Don't forget to save your wrangled data!
save(blanco_data_long.df, file="./data/Lecture04_blanco.RData")
After all of our data wrangling, you can see we have a data frame with nearly 1/4 million observations. This spans across 10 experimental conditions. From a broad level, we can look across entire data sets to identify genes of interest. There are a couple of ways to do this and we’ll begin with a standard set of plots that can convey the overall distribution of our data. While our view of the data is low-resolution we are still able to compare specific gene information centered around mean expression levels, magnitude of comparison, and statistical significance.
Used originally in the application of Microarray data analysis, these plots are also applied to visualizing high-throughput sequencing analysis.
This is a data transformation between two sets of data such that
M (minus in the log scale) represents the log\(_{2}\) ratio of fold differences between
your data sets
A (average in the log scale) represents the average
expression signal
We plot our M values along the y-axis against the
corresponding A values on the x-axis. Do these
characteristics sound familiar? They are standard output for
DESeq2 data sets! Unfortunately our above data doesn’t have
that information anymore, but we have a dataset that does! It’s another
dataset from an earlier pre-publication copy of the Blanco-Melo et al.,
2020 paper: Blanco-Melo2020_Supp_Data_4.xlsx.
Let’s open this up and take a look!
# We'll be looking at DE data for IAV infection of A549 cells
DE_A549_IAV.df <- read_excel("./data/Blanco-Melo2020_Supp_Data_4.xlsx", sheet=2)
# Look at the data frame we've made
head(DE_A549_IAV.df)
across()Okay, so we can’t use the data right away. The
read_excel() command has decided to convert some of our
columns to character format because it detected NaN values.
So we’ll have to quickly coerce the data before moving on. We can
quickly convert the data with a mutate() command using the
helper verb across().
To successfully use this helper, we will want to list out the columns
we want to use in the .cols parameter, and then apply a
function like as.numeric. We don’t actually need
all of these columns, but it’s good practice!
# Convert our numeric columns to the proper type.
DE_A549_IAV.df <-
DE_A549_IAV.df %>%
# Use mutate() with across() to convert columns to numeric
mutate(across(.cols=c(3:7), as.numeric))
# Check the result
str(DE_A549_IAV.df)
tibble [27,914 × 8] (S3: tbl_df/tbl/data.frame)
$ GeneName : chr [1:27914] "A1BG" "A1BG-AS1" "A1CF" "A2M" ...
$ baseMean : num [1:27914] 188.281 71.851 36.154 0.637 15.91 ...
$ log2FoldChange: num [1:27914] 0.625 -0.271 -1.411 3.668 -1.107 ...
$ lfcSE : num [1:27914] 0.351 0.515 0.635 4.761 0.924 ...
$ stat : num [1:27914] 1.783 -0.526 -2.222 0.77 -1.198 ...
$ pvalue : num [1:27914] 0.0746 0.5991 0.0263 0.441 0.2308 ...
$ padj : num [1:27914] 0.361 0.863 0.204 NaN 0.61 ...
$ status : chr [1:27914] "OK" "OK" "OK" "Low" ...
Skip the wrangling with a proper import! While we used the above import as an opportunity to practice our wrangling skills, we could have avoided it altogether if we imported properly. Within the read_excel() function there is a parameter called “na” which allows us to define the character(s) that represent null or NA values. The default is simply blank or empty cells but knowing what we do now about the dataframe, we could also have used set the parameter: na = “NaN” which would allow our dataframe to properly import the values and correctly assign variable types. So, it’s always helpful to know what your functions can do!
ggplot2Given that we already have our data in a differential expression
format, we can just use ggplot2 to plot the correct
variables to the x and y axes. In this case, we are interested in using
the log2FoldChange as our M and
baseMean represents our A. There are a number
of ways we can colour this plot but we’ll take into account our
padj values to highlight those genes with DE values that
are statistically significant.
Now, if we were using raw or normalized expression data, we would
need to calculate the proper values for the transformation of
M and A. There are a few packages that can
take care of this for you such as ggmaplot which you can
find here
For our plot, to differentiate between our points, let’s use the
gghighlight package from last week! What will our
predicates be based on? Log2 fold change? Adjusted
p-value?
# MA plot! Need to select based on a fold change > x and padj <= fdr
# Usually FC = 1.5, fdr = 0.05 but we'll use the "status" variable here instead.
DE_A549_IAV.df %>%
filter(status == "OK") %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=baseMean, y = log2FoldChange) +
# Theme
theme_bw() +
theme(text = element_text(size=20)) +
# 3. Scaling
# Use a log10 scale on the x axis
scale_x_log10() +
# Set the breaks on our y-axis
scale_y_continuous(limits = c(-10, 10), breaks = seq(-10, 10, 2)) +
# 4. Geoms
# Colour our points all in red
geom_point(colour = "red", alpha = 0.2, size=4, na.rm=TRUE) +
### 2.1.2
# Change it so that only high L2FC with low p-values will be in red, the rest as black
gghighlight(padj < 0.05, abs(log2FoldChange) > 1.5, unhighlighted_params = list(colour = "black"))
So our plot shows us the general distribution of the DE data. Looking at it, we can see, notably, that it appears that we have many more genes upregulated than downregulated in comparison to our control. We also see, as expected, more low-probability variation occurring with lower normalized mean counts. Conversely, as our overall mean counts increase, we also see less variation in general but we do see a gene that does pass threshold DE with a low enough p-value! To investigate further you could filter the data or look again with a volcano plot!
While the MA plots explored data based on its fold-change values in relation to mean expression, the volcano plot looks purely at differential expression based on it’s p-value (ie the probability that the DE is real). To emphasize the importance of the p-value we will -log transform it so smaller p-values are higher up on the y-axis. This quickly partitions data points in the visualization to draw high-DE/low p-value combinations away from the bulk of the population.
To plot this data, we’ll return to our larger
blanco_data_long.df which contains data from multiple
experiments. For now, let’s focus on just the SARS-CoV-2 data that has
been generated in the A549 host cells. Our -log transformation
of the p-values will also be problematic when encountering 0, so be
careful! Let’s compare plots with and without these problem values.
To label our points of interest, we’ll also use the
geom_text_repel() from the ggrepel package.
This will help arrange and plot text onto our scatterplot in a way that
avoids collisions with the data and the other labels! You could also
experiment with using the directlabels package that we
played around with last week to accomplish a similar feat.
# Remind ourselves what our data table looks like
head(blanco_data_long.df)
# Remove p-values = 0
volcano.plot1 <-
# Volcano plot!
blanco_data_long.df %>%
filter(pathogen == "SARS-CoV-2", host=="A549", padj !=0) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=L2FC, y = -log10(padj)) +
# Themes
theme_bw()+
theme(text = element_text(size=20)) +
labs(title = "Volcano plot with p-values == 0 removed") +
# 4. Geoms
# points all coloured red
geom_point(colour = "red", alpha = 0.7, size=4) +
# highlight leaves only the ones meeting our criteria as red, colour rest as blue
gghighlight(abs(L2FC) > 2, padj < 0.05, unhighlighted_params = list(colour = "lightblue")) +
### 2.2.0
# Add labels but only for the top 10 genes
geom_text_repel(data = blanco_data_long.df %>%
# Filter the data used to label
filter(pathogen == "SARS-CoV-2", host=="A549", abs(L2FC) > 2, padj !=0) %>%
# Arrange in ascending order
arrange(padj) %>%
# Take the first 10 results
dplyr::slice(1:10),
aes(label = GeneName), size=6)
# leave in p-values = 0
volcano.plot2 <-
# Volcano plot!
blanco_data_long.df %>%
filter(pathogen == "SARS-CoV-2", host=="A549") %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
aes(x=L2FC, y = -log10(padj)) +
# Themes
theme_bw()+
theme(text = element_text(size=20)) +
labs(title = "Volcano plot with all p-values") +
# 4. Geoms
# points all coloured red
geom_point(colour = "red", alpha = 0.7, size=4) +
# highlight leaves only the ones meeting our criteria as red, colour rest as blue
gghighlight(abs(L2FC) > 2, padj < 0.05, unhighlighted_params = list(colour = "lightblue")) +
# Add labels but only for the top 10 genes
geom_text_repel(data = blanco_data_long.df %>%
# Filter the data used to label
filter(pathogen == "SARS-CoV-2", host=="A549", abs(L2FC) > 2) %>%
# Arrange in ascending order
arrange(padj) %>%
# Take the first 10 results
dplyr::slice(1:10),
aes(label = GeneName), size=6)
# Plot both of our figures together
fig.show="hold"; out.width="50%"
volcano.plot1
volcano.plot2
Looking at our volcano plots, we can clearly see the volcano shape we are looking for. The data is split between our \(\pm\) log\(_{2}\) fold changes with points highlighted when we see DE beyond this. You can customize your parameters but you can also see the emphasis on lower p-values in our data set. The most “significant” data is found in the upper left and right quadrants of the graph, which wasn’t made clear in the MA plot. You could use the same tricks to label your MA plot too but the data could end up anywhere along the MA plot.
In the end, if you do have access to the original counts, it may be of use to consider this information when further investigating your candidate genes from DE.
Section 2.0.0 Comprehension Question: Comparing our two versions of the volcano plot, we see different sets of genes highlighted as the “top 10” based on adjusted p-values. What makes these two sets of data different? Why do we see datapoints at the top edge of our plots in the second version?
Now that we’ve taken an overall look at our data, we can begin to assess our data based on some candidate genes or hypothesis testing. For instance, you may only be interested in looking at genes with an L2FC > 3. In other cases, you may be interested in a group of genes pertaining to a function like the chemokines which are part of the inflammation response.
After selecting a subset of genes we can begin to compare their DE
data more meaningfully across different sample sets. Let’s continue
working with our long-format dataset blanco_data_long.df
which contains our multiple RNAseq experiments from various infection
conditions. We will begin by generating a list of the highest DE genes
with low p-values in the SARS-CoV-2 infection scenario.
Note here we’ll use the pull() function which is a
nicer-looking way to retrieve a single column as a vector from
our tibble object. This is equivalent to using a
.$colName format.
# Select the genes we want to visualize
heatmap_gene.list <-
blanco_data_long.df %>%
# filter for data from a specific experiment, and then by high positive L2FC values with low p-values
filter(pathogen == "SARS-CoV-2", host == "A549", L2FC > 3, padj < 0.001) %>%
dplyr::pull(GeneName)
# Take a look at the resulting list
heatmap_gene.list
[1] "IL6" "IL1A" "CSF2" "PTX3" "ULBP1" "STC2" "ADM2" "MMP10" "RPS6KA2" "PCDH1" "GREM1" "TM6SF1" "LAMC2"
[14] "ALDH1L2" "INHBE" "ROCK1P1" "FLJ35024" "FUT1"
Now that we have generated a select group of genes to examine, our DE
data is well-suited to visualizing in heatmap or heat plot. In our
heatplot we can generate values along two categorical axes. On the
y-axis we can plot DE values from our individual genes, while on the
x-axis we will show values across multiple experiments. To visualize the
data itself, we will use the geom_tile() command.
# heatmap of our data based on a list of top DE genes from the SARS-CoV-2 infection of A549 cells
blanco_data_long.df %>%
filter(GeneName %in% heatmap_gene.list,
host == "A549"
) %>%
# mutate(GeneName = factor(GeneName, levels = heatmap_gene.list)) %>%
# 1. Data
ggplot(.) +
#2. Aesthetics
### 3.1.0 set aesthetics
aes(x=pathogen, y = GeneName, fill=L2FC) +
# Theme
theme_bw()+
theme(text = element_text(size=18)
)+
labs(title = "Heatmap of L2FC values in A549 cells infected by different pathogens") +
# 3. Scaling
# Keep this to reverse the y axis or put it into the assignment?
scale_y_discrete(limits = rev) +
scale_fill_viridis_c(option="plasma") +
# Geoms
### 3.1.0 Use a tile and set the width a little smaller to give a gap between groups
geom_tile(width=0.95)
# heatmap of our data based on a list of top DE genes from the SARS-CoV-2 infection of A549 cells
# Split x-axis by experiment name
blanco_data_long.df %>%
filter(GeneName %in% heatmap_gene.list,
pathogen == "SARS-CoV-2"
) %>%
# 1. Data
ggplot(.) +
#2. Aesthetics
aes(x=experiment, y = GeneName, fill=L2FC) +
# Theme
theme_bw()+
theme(text = element_text(size=18),
axis.text.x = element_text(angle=45, hjust = 1)
) +
labs(title = "Heatmap of L2FC values in SARS-CoV2 infections by experiment") +
# 3. Scaling
# Keep this to reverse the y axis or put it into the assignment?
scale_y_discrete(limits = rev) +
scale_fill_viridis_c(option="plasma") +
# Geoms
# Use a tile and set the width a little smaller to give a gap between groups
geom_tile(width = 0.95)
From our list of the top 18 DE hits in our SARS-CoV-2 set, it looks like we see some similar hits across some of the genes when infecting A549 cells with other viruses like HPIV3 and RSV.
We see strong L2FC values for IL6, IL1A and LAMC2 for instance. IL6 (interleukin 6) has been reported to play a role in both mounting an effective immune response to certain viral infections but its interactions with other factors may also have a potential role in the exacerbation of viral phenotypes. In looking across different host cell infections by SARS-CoV-2 there is again consistent upregulation of IL1A and IL6.
Using a similar method, we can instead focus on genes from a specific family or pathway as long as we have a list. Let’s try the family of chemokines which play a role in inducing cell movement during the immune reponse.
# Build a list of the potential chemokines by looking for those matching the "CCL" pattern
heatmap_ccl.list <-
blanco_data_long.df %>%
# Filter specifically for any genes that have CCL in their name
filter(str_detect(GeneName, "CCL")) %>%
pull(GeneName) %>%
unique()
str(heatmap_ccl.list)
chr [1:29] "CCL5" "CCL4" "CCL22" "CCL20" "CCL3" "CCL2" "CCL1" "CCL11" "CCL13" "CCL14" "CCL15" "CCL15-CCL14" "CCL16" "CCL17" "CCL18" "CCL19" ...
Now we can use this list in our ggplot visualization.
Note that we could have generated a much longer pipe series
without saving the object heatmap_ccl.list but then we
wouldn’t have a chance to examine it before plotting that data to learn
that it has 29 unique values.
# heatmap of our data based on a list of chemokine genes
blanco_data_long.df %>%
# Filter for genes in our list, and experiments where the host cell was "A549"
filter(GeneName %in% heatmap_ccl.list,
host == "A549"
) %>%
# 1. Data
ggplot(.) +
#2. Aesthetics
aes(x=experiment, y = GeneName, fill=L2FC) +
# Theme
theme_bw()+
theme(text = element_text(size=20),
axis.text.x = element_text(angle=45, hjust = 1)
)+
# 3. Scaling
scale_fill_viridis_c(option="plasma")+
# Geoms
# Use a tile and set the width a little smaller to give a gap between groups
geom_tile(width = 0.95)
From our heatmap, we can observe that looking only at the chemokines, there is consistent upregulation of CCL5 and CCL2 when infecting our A459 cells with any of these viruses. In comparison, however, HPIV3 and RSV infection show a much higher DE across the CCL family in comparison to SARS-CoV-2 which appears to illicit less inflammatory response in the time-frame sampled.
But wait there’s more! While we are dealing with just a subset of our RNA-Seq data, much more is available to visualize on a greater scale. Next week we’ll dig deeper into high-dimensional data and how heatmaps can be used to visualize and organize this kind of information to help identify and convey patterns.
Note that in the above heatmap we no longer have any information on the adjusted p-values from these L2FC scores. Something to keep in mind when making these heatmaps. What would happen if we filtered on p-values as well? Is there a way we could track that information?
If we replace our geom_tile() with a
geom_point() we are able to add an additional dimension to
our data by utilizing size mapping.
We’ll repeat our visualization from above but use colour to represent
the L2FC values and a size to represent the
padj values in our dataset.
# heatmap of our data based on a list of chemokine genes
blanco_data_long.df %>%
# Filter for genes in our list, and experiments where the host cell was "A549"
filter(GeneName %in% heatmap_ccl.list,
host == "A549"
) %>%
# 1. Data
ggplot(.) +
#2. Aesthetics
aes(x=experiment, y = GeneName) +
# Theme
theme_bw()+
theme(text = element_text(size=20),
axis.text.x = element_text(angle=45, hjust = 1)
)+
# 3. Scaling
scale_colour_viridis_c(option="viridis", ) +
scale_size_continuous(range = c(1, 10), breaks = c(0, 1.3, 10, 100)) +
# Geoms
### 3.2.0 Use colour and size in our dotplot to represent L2FC and padj
geom_point(aes(colour = L2FC, size = -log10(padj)))
Now we can more clearly see which values are worth noting or
investigating. We can see that CCL2 has a more consistent
L2FC and padj combination although this would
require further investigation since our size range is still quite
limited. We can further zoom in on our data by specifically visualizing
our genes of interest!
We can further narrow our search instead of looking at whole gene groups, and pick a small set of specific genes. We’ll use a dotplot to visualize the data where the y-axis uses the L2FC value to place points and we’ll use a categorical x-axis of gene lists. We’ll filter our data to only include infections in the A549 host, and also group our data in a few ways:
single_genes = c("CCL5", "CCL2", "CCL20", "CCL17", "IL6", "IL1A")
blanco_data_long.df %>%
filter(GeneName %in% single_genes,
) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
### 3.2.0 set the aesthetics for our dot plot colouring by pathogen, shaping by host
aes(x=GeneName, y = L2FC, colour=pathogen, shape=host) +
# Theme
theme_bw()+
theme(text = element_text(size=20)) +
# 3. Scaling
scale_colour_viridis_d(direction = -1,
guide = guide_legend(override.aes = list(size = 5))) +
# Scale the size range
scale_size(range = c(8, 4)) +
scale_shape(guide = guide_legend(override.aes = list(size = 5))) +
# 4. Geoms
### 3.2.0 adjust the dot size based on the adjusted p-value
geom_point(aes(size = padj), alpha = 0.8, stroke = 2) +
# 6. Facets
facet_wrap(~ host, ncol = 2, scales = "free_y")
Focusing on small sets of genes we can see some trends across our groups. For instance, 3 of our pathogens produce a strong IL6 response as we’ve seen in our heatmaps. We also see a >2 L2FC in CCL2 as a response to infection by SARS-CoV-2. Actually our weakest IL6 response to SARS-CoV-2 was seen in the ACE-2-expressing A459 hosts so our initial heatmap in section 3.1.0 didn’t give us a full picture of what might be happening.
It would be best to further investigate these on a more specific subset of hosts and/or pathogens.
Comprehension Question 3.0.0 While we see some inconsistency in our IL6/SARS-CoV-2 response in the above dotplot, are we possibly missing some context? If you were to return to the source of this data prior to our wrangling (see section 1.5.2), you might note that there are 3 kinds of A549-ACE2 experimental groups. What are some ways you could account for this information when wrangling?
Up until this point, we’ve been feeling our way through the data, rather… undirected. Often within manuscripts you will find deeper analyses of differential expression by examining the collection of functional terms to determine trends or pathways of interest. The Gene Ontology (GO) resource represents a codification of gene function through three domains: cellular component (CC), molecular function (MF), and biological process (BP).
Most, if not all, of the genes in our dataset can be described in some way by a GO term. Once we collect these terms, we can begin to look for meaningful groups that could be considered over-represented (or under-represented) in our dataset. Once we have collected this information we can visualize it similar to our individual gene dotplots.
goseq package to pull down GO terms for
your genesUsing the goseq package we can perform a full GO term
analysis on our DE data with an analysis for enriched terms within our
set of over- or under-expressed genes. However, in order to begin the
process of pulling down GO terms, with the goseq package,
we also need to generate a named integer vector to split our gene sets
into the two relevant categories.
We’ll store our categorized set of genes in
genes_SARS_CoV_2 where a 1 represents over/underexpressed
genes in our set, and 0 represents all remaining genes.
# Set up the gene information we need for the GO-term analysis
# It needs to know which genes meet our FC criteria and which do not
#nrow(blanco_data.df)
# Casting our logical as an integer will create a 0/1 matrix
genes_SARS_CoV_2 <- as.integer(blanco_data.df$`padj_SARS-CoV-2(A549)` < 0.05 & # low p-value
abs(blanco_data.df$`SARS-CoV-2(A549)_L2FC`) > 1.5) # absolute L2FC > 1.5
# We'll name each element in our vector by the gene names
names(genes_SARS_CoV_2) <- blanco_data.df$GeneName
# What is our object?
str(genes_SARS_CoV_2)
Named int [1:23710] 0 0 0 0 0 1 0 0 1 0 ...
- attr(*, "names")= chr [1:23710] "IFNL1" "IFNL2" "IFNL3" "IFNB1" ...
# Summarize our vector
table(genes_SARS_CoV_2)
genes_SARS_CoV_2
0 1
23232 478
So a quick bit of filtering and we can see from our table summary, that we have close to 500 genes that are significantly differentially expressed above or below 1.5 L2FC. Now we have to figure out how to map these to GO terms!
The goseq package is compatible with a number of
organisms and gene names for mapping your data to the GO database. To
view which organisms are supported, we can use
supportedOrganisms() to view. Since the raw RNA-Seq reads
were aligned to the human genome using the hg19 assembly (see
Blanco-Melo et al., 2020), we’ll be working with hg19 as
well. We’ll use the annotation information provided for the next part of
our analysis.
First, however, let’s check that hg19 is indeed a supported genome annotation.
# Pull down the supported organisms and look for the human "hg19" genome
supportedOrganisms() %>%
filter(Genome == "hg19")
Looking at our table above we can immediately see that there are 3 potential annotations sets to draw from. Each uses a different kind of ID description: Entrez Gene ID, Ensemble gene ID, and Gene Symbol. Which one do we use? This depends on how our data is formatted but let’s take a quick look at this screenshot of CCL8 entry from NCBI:
It’s important to know the difference between your gene symbols and various identifiers!
Knowing what we’ve seen of the data already, the information we’re looking for is in the third row. Using the gene symbols in our data, we can
The second portion of information is required in the generation of a probability weighting function (PWF). This PWF will determine a weighting for each gene which is essentially the probability of a gene being differentially expressed based on its length alone. It’s important to generate with information as it can influence our enrichment analysis. Ultimately we use the PWF to help inform the creation of a null distribution in our analysis.
We’ll use the nullp() function to generate the PWF for
our dataset.
# Generate the PWF for our vector of genes
pwf_SARS_CoV_2 <- nullp(DEgenes = genes_SARS_CoV_2, # Our list of DE genes
genome = "hg19", # The GO annotation we want to use
id = "geneSymbol") # The id we'll use to generate the data using gene symbols
Loading hg19 length data...
Warning: initial point very close to some inequality constraintsWarning: collapsing to unique 'x' valuesWarning: collapsing to unique 'x' valuesWarning: collapsing to unique 'x' valuesWarning: collapsing to unique 'x' values
head(pwf_SARS_CoV_2)
Looking at a plot of our PWF: here we see a subset of genes and their PWF values to see how they fit according to our line of best fit. The goal is for the points to be relatively close to our line of fit. If the values are far off, there may be an issue with the annotation data or how you calculated your gene lengths for the PWF.
We’re now ready to query the GO database for up- and down-regulated
terms in our data. We’ll use the goseq() command to do
this, providing our PWF object as well as the correct database
information. The default method used in this process to generate our
enrichment analysis is the Wallenius
approximation. You can choose to use random sampling but
this method can take much longer with little difference in results.
For more information on this process see the goseq
package vignette here
# Generate our go enrichment analysis
GO.wall_SARS_CoV_2 <- goseq(pwf = pwf_SARS_CoV_2, # Provide our pwf object
genome = "hg19", # Set the annotation set to use
id = "geneSymbol", # Which variable will we use to compare with
method = "Wallenius" # How will you decide on the null distribution
)
Fetching GO annotations...
For 5382 genes, we could not find any categories. These genes will be excluded.
To force their use, please run with use_genes_without_cat=TRUE (see documentation).
This was the default behavior for version 1.15.1 and earlier.
Calculating the p-values...
'select()' returned 1:1 mapping between keys and columns
# What does the final result look like?
head(GO.wall_SARS_CoV_2)
# What is the structure of our resulting data?
str(GO.wall_SARS_CoV_2)
'data.frame': 22500 obs. of 7 variables:
$ category : chr "GO:0006261" "GO:0071162" "GO:0000727" "GO:0006260" ...
$ over_represented_pvalue : num 1.59e-10 1.64e-09 3.89e-09 6.80e-09 8.36e-09 ...
$ under_represented_pvalue: num 1 1 1 1 1 ...
$ numDEInCat : int 21 7 7 25 7 10 8 86 132 74 ...
$ numInCat : int 152 11 12 262 13 36 20 2178 4044 1710 ...
$ term : chr "DNA-templated DNA replication" "CMG complex" "double-strand break repair via break-induced replication" "DNA replication" ...
$ ontology : chr "BP" "CC" "BP" "BP" ...
goseq resultsLooking at the results of our analysis, we are returned a listing of 22,500 unique GO terms across the 3 domains (CC, BP, and MF). We are given two sets of p-values, one each for over-represented and under-represented terms. With each term we also see how many times that term is represented in our DE data vs the total number of times that term may appear in the GO database we queried. As you can see from the first few rows of data, some GO term categories can encompass a large number of genes.
Remember how we created this data! Let’s also remind ourselves that we set up our DE gene list using the data from A459 cells infected with SARS-CoV-2. Looking at the manuscript itself, while having a distinct immune response, the replication of the virus was very low in these cell types which lack the expression of the ACE2 receptor. We’re really just using this data to make some visualizations!
To begin our analysis we can filter the enriched GO terms sets by
applying a multiple hypothesis test correction with
p.adjust() using the Benjamini-Hochberg
method to minimize the false discovery rate.
# Generate an enriched Go term dataset
enriched.GO_SARS_CoV_2 <-
# Adjust the p-values of the over-represented column
# GO.wall_SARS_CoV_2[p.adjust(GO.wall_SARS_CoV_2$over_represented_pvalue, method="BH") < 0.05, ]
GO.wall_SARS_CoV_2 %>%
filter(p.adjust(over_represented_pvalue, method = "BH") < 0.05)
# How many terms come back from our filtering?
str(enriched.GO_SARS_CoV_2)
'data.frame': 84 obs. of 7 variables:
$ category : chr "GO:0006261" "GO:0071162" "GO:0000727" "GO:0006260" ...
$ over_represented_pvalue : num 1.59e-10 1.64e-09 3.89e-09 6.80e-09 8.36e-09 ...
$ under_represented_pvalue: num 1 1 1 1 1 ...
$ numDEInCat : int 21 7 7 25 7 10 8 86 132 74 ...
$ numInCat : int 152 11 12 262 13 36 20 2178 4044 1710 ...
$ term : chr "DNA-templated DNA replication" "CMG complex" "double-strand break repair via break-induced replication" "DNA replication" ...
$ ontology : chr "BP" "CC" "BP" "BP" ...
# Load our goseq data to look at and visualize
# This will also contain some additional datasets we'll want to look at.
load("./data/Lecture04.RData")
# What kind of terms do we see in our enrichment?
enriched.GO_SARS_CoV_2$term
[1] "DNA-dependent DNA replication"
[2] "DNA replication origin binding"
[3] "double-strand break repair via break-induced replication"
[4] "cell population proliferation"
[5] "regulation of cell population proliferation"
[6] "DNA replication initiation"
[7] "DNA replication"
[8] "extracellular region"
[9] "MCM complex"
[10] "cellular response to organic substance"
[11] "response to stress"
[12] "pre-replicative complex assembly involved in nuclear cell cycle DNA replication"
[13] "pre-replicative complex assembly"
[14] "pre-replicative complex assembly involved in cell cycle DNA replication"
[15] "cellular response to chemical stimulus"
[16] "cell cycle"
[17] "cell cycle G1/S phase transition"
[18] "cell cycle DNA replication"
[19] "G1/S transition of mitotic cell cycle"
[20] "3'-5' DNA helicase activity"
[21] "positive regulation of cell population proliferation"
[22] "cytokine-mediated signaling pathway"
[23] "extracellular space"
[24] "cellular response to biotic stimulus"
[25] "response to organic substance"
[26] "cell cycle process"
[27] "blood vessel development"
[28] "vasculature development"
[29] "regulation of extrinsic apoptotic signaling pathway in absence of ligand"
[30] "nuclear DNA replication"
[31] "cholesterol biosynthetic process via desmosterol"
[32] "cholesterol biosynthetic process via lathosterol"
[33] "cellular response to lipopolysaccharide"
[34] "negative regulation of signal transduction in absence of ligand"
[35] "negative regulation of extrinsic apoptotic signaling pathway in absence of ligand"
[36] "mitotic cell cycle process"
[37] "response to cytokine"
[38] "cellular response to cytokine stimulus"
[39] "enzyme linked receptor protein signaling pathway"
[40] "DNA unwinding involved in DNA replication"
[41] "growth factor receptor binding"
[42] "cellular response to molecule of bacterial origin"
[43] "steroid metabolic process"
[44] "mitotic DNA replication"
[45] "epithelial cell differentiation"
[46] "extracellular matrix organization"
[47] "extracellular structure organization"
[48] "sterol metabolic process"
[49] "alcohol metabolic process"
[50] "cholesterol biosynthetic process"
[51] "secondary alcohol biosynthetic process"
[52] "positive regulation of cellular component movement"
[53] "response to chemical"
[54] "regulation of nucleotide-binding oligomerization domain containing signaling pathway"
[55] "cholesterol metabolic process"
[56] "signal transduction in absence of ligand"
[57] "extrinsic apoptotic signaling pathway in absence of ligand"
[58] "tissue development"
[59] "positive regulation of vasculature development"
[60] "positive regulation of cell motility"
[61] "sterol biosynthetic process"
[62] "single-stranded DNA helicase activity"
[63] "cellular modified amino acid catabolic process"
[64] "positive regulation of cell migration"
[65] "cell surface receptor signaling pathway"
[66] "secondary alcohol metabolic process"
[67] "positive regulation of locomotion"
[68] "DNA helicase activity"
[69] "organic hydroxy compound biosynthetic process"
[70] "protein kinase B signaling"
[71] "catalytic activity, acting on DNA"
[72] "alcohol biosynthetic process"
[73] "DNA metabolic process"
[74] "organic hydroxy compound metabolic process"
[75] "mitotic cell cycle"
[76] "receptor regulator activity"
[77] "regulation of transcription involved in G1/S transition of mitotic cell cycle"
[78] "regulation of cell cycle"
[79] "steroid biosynthetic process"
[80] "regulation of nucleotide-binding oligomerization domain containing 2 signaling pathway"
[81] "fibroblast growth factor-activated receptor activity"
[82] "nucleotide-binding oligomerization domain containing signaling pathway"
[83] "inflammatory response"
[84] "response to lipopolysaccharide"
Looking at our enriched set of GO terms, we see terms like
inflammatory response and response to stress.
Now that we have a set of enriched GO terms, we can filter and plot this
information as a dotplot with the following characteristics:
Let’s look at terms that include the word “response”.
enriched.GO_SARS_CoV_2 %>%
# Generate a percentage value
mutate(percentOfCat = numDEInCat/numInCat,
# Create a dummy category
# pathogen = "SARS-CoV-2",
# Make a set of adjusted p-values
FDR = p.adjust(over_represented_pvalue, method = "BH")
) %>%
# Filter for only terms that include the word "response"
filter(str_detect(term, "response")) %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
### 4.2.1 set aesthetics by numDEInCat and FDR
aes(x="SARS-CoV-2", y=term, size=percentOfCat, colour = FDR) +
# Theme
theme_bw() +
theme(text = element_text(size=20)) +
# 3. Scaling
scale_size(range=c(3,10)) +
scale_colour_viridis_c(direction = -1) +
# 4. Geoms
geom_point()
With a little bit of elbow grease, we can generate a data frame with
multiple GO enrichment analyses for multiple experiments and
now we can compare enrichment across data sets! Our
Lecture04.RData contained the combined experimental GO
results in the object enriched.GO_combined.df.
# Take a look at our combined data frame
enriched.GO_combined.df
# Load up a combined enrichment analysis
# load("./data/Lecture04.RData")
enriched.GO_combined.df %>%
# 1. Data
ggplot(.) +
# 2. Aesthetics
### 4.2.4 Set the new x-axis value and colour aesthetic
aes(x=pathogen, y=term, size=numDEInCat, colour = over_represented_pvalue) +
# Theme
theme_bw() +
theme(text = element_text(size=20)) +
# 3. Scaling
scale_size(range=c(3,10)) +
scale_colour_viridis_c(direction = -1) +
# 4. Geoms
geom_point()
Upset plots are an alternative to Venn diagrams that show the intersection of sets, as well as the size of the sets. Additionally, Venn diagrams can be difficult to interpret for greater than 2 or 3 sets. This is a real life figure from BMC Bioinformatics. Sure it looks pretty, but what does the number 24 represent in this picture in terms of A, B, C, D, and E?
What is the meaning of the value 24 in this diagram? Stare at it long enough and you’ll see which group it’s in but imagine this was 10 groups?
The UpSet plot was first published in 2014 and has
become a helpful tool for visualizing the intersection between
components with datasets. Along with the publication came a package for
working with these plots. Known as the UpSetR
package, there have since been more projects implementing this kind
of visualization. While UpSetR has not been updated in
nearly 3 years, a more active package known as ComplexUpset
is available.
To some extent, the ComplexUpset has extensibility with
ggplot, allowing you to use somewhat familiar syntax to
modify these plots. Add to this the ability to stack or include
additional visualizations of your datasets distributions or other
characteristics, and this is a pretty attractive package to work
with.
ComplexUpset package to
visualize overlapping datasetsLet’s see how UpSet plots work practically. We can compare our data
from blanco_data_long.df and compare the overlap of DE
genes (after filtering or not). To do this in a practical sense, we
again need to convert our values to a boolean representation after some
more data wrangling. Our data wrangling steps will include:
Let’s look at our data structure again.
head(blanco_data_long.df)
Begin by making our short-list of genes to filter by.
# filter for upregulated genes in our SARS-CoV-2/A549 data set and make a list of gene names
SARS_CoV_2.list <-
blanco_data_long.df %>%
# Filter our data
filter(pathogen == "SARS-CoV-2",
host == "A549",
L2FC >= 2, padj < 0.05) %>%
# Select just our gene names using a shorthand conversion
pull(GeneName)
# Less code than the following steps:
# dplyr::select(GeneName) %>% unlist() %>% as.character()
SARS_CoV_2.list
[1] "CCL5" "ICAM1" "GBP4" "FYN" "IL6" "IL1A" "IL1B" "TNFRSF9" "EREG" "IL3RA" "MMP1" "CSF2"
[13] "PTX3" "CFB" "TREM1" "ULBP1" "CLEC4E" "OLR1" "IL8" "ATF3" "RUNX2" "TIPARP" "TRIM36" "DNAH17"
[25] "ABCA1" "STC2" "ADM2" "MMP10" "SLFN5" "HMGN2P46" "SERPINB7" "BMPER" "BCAT1" "MTHFD2" "ATP6V1C2" "TMEM156"
[37] "C15orf48" "RPS6KA2" "HSD11B1" "SERPINB4" "ZMIZ1-AS1" "SLC6A15" "ECM2" "PCDH1" "C12orf39" "EXPH5" "SERPINB3" "GREM1"
[49] "TM6SF1" "SCG5" "RASGRP2" "BEX2" "THEG" "SDR16C5" "TSC22D3" "SCN4A" "GAS5" "ZFAS1" "C6orf48" "RAB39B"
[61] "LAMC2" "CDCP1" "CCBE1" "PARP8" "MAP2" "AIM1" "RASGRF2" "KCNQ3" "CDH3" "AMPD3" "PTPRE" "BCL2A1"
[73] "ALDH1L2" "EGOT" "CHAC1" "INHBE" "ROCK1P1" "FLJ35024" "FUT1" "N4BP3" "ANK2" "RBMS3" "PGM5P2" "ARRDC4"
[85] "ZMAT1" "NDUFA4L2" "DDIT3"
Filter our blanco_data_long.df and pivot to a
wide-format containing our logical values for inclusion.
# Generate our upset data
Warning messages:
1: In findGeneric(f, envir) :
'select' is a formal generic function; S3 methods will not likely be found
2: In findGeneric(f, envir) :
'select' is a formal generic function; S3 methods will not likely be found
3: In findGeneric(f, envir) :
'select' is a formal generic function; S3 methods will not likely be found
4: In findGeneric(f, envir) :
'select' is a formal generic function; S3 methods will not likely be found
blanco_upset.df <-
blanco_data_long.df %>%
# Filter our data list using our genes of interest
filter(GeneName %in% SARS_CoV_2.list) %>%
# convert our L2FC data to a logical based on a value of >2
mutate(upregulated = (L2FC >=2) & (padj < 0.05)) %>%
# Select just the gene names, experiments, and the logical variable
dplyr::select(GeneName, experiment, upregulated) %>%
# Pivot our data wide so that each GeneName is now a row, each experiment is a column
pivot_wider(names_from = experiment, values_from = upregulated) %>%
# Convert to a dataframe for the ComplexUpset package (or it will throw an error)
as.data.frame()
str(blanco_upset.df)
'data.frame': 87 obs. of 11 variables:
$ GeneName : chr "CCL5" "ICAM1" "GBP4" "FYN" ...
$ SARS-CoV-1 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ IAV : logi TRUE FALSE FALSE FALSE FALSE FALSE ...
$ MERS-CoV : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ HPIV3 : logi FALSE FALSE TRUE FALSE TRUE TRUE ...
$ RSV : logi FALSE TRUE TRUE FALSE TRUE TRUE ...
$ SARS-CoV-2(Calu-3) : logi TRUE TRUE TRUE FALSE TRUE TRUE ...
$ SARS-CoV-2(A549) : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ SARS-CoV-2(A549-ACE2)LowMOI : logi FALSE TRUE FALSE FALSE FALSE TRUE ...
$ SARS-CoV-2(A549-ACE2)HiMOI : logi FALSE FALSE TRUE FALSE TRUE FALSE ...
$ SARS-CoV-2(A549-ACE2)-Ruxolitinib: logi FALSE FALSE FALSE FALSE FALSE TRUE ...
As you can see from our data transformations, we now have all of our
experiments listed in columns, with each gene represented in each row.
As we look down each column we see a value of TRUE or
FALSE used to differentiate if there was overexpression of
that gene in that specific experimental context.
If we had not filtered based on some set of genes, we would have a data frame with more than 23K columns! Now that we have our genes presented in this way we can proceed to generate an upset plot.
Working with the upset() function to build our plot, we
will concern ourselves with the following parameters:
data: a dataframe including binary columns that
represent membership in each class.intersect: a list of column names which will be used to
generate the intersecting data.name: the label shown below the intersection
matrix.height_ratio: ratio of the intersection matrix to the
intersection height (default = 0.5).width_ratio: ratio of the overall set size width to
intersection matrix width (default = 0.3).min_size and max_size: minimal and maximal
number of observations for an intersection to be included.min_degree and max_degree: minimal and
maximal number of degrees for an intersection to be included.n_intersections: the maximum number of intersections to
display based on our min_* or max_*
criteria.themes: a parameter used to pass theme changes through
the upset_default_themes() or
upset_modify_themes() functions.
# Generate our upset plot
upset(data = blanco_upset.df, # Provide our dataset
intersect = colnames(blanco_upset.df)[2:11], # Name the columns we want to analyse
name='Experimental condition', # Set the label below the intersection matrix
width_ratio=0.1, # Make the Set size width a little smaller
min_size = 2, # Require there to be a minimum of 2 members to show an intersection
n_intersections = 15, # Set the max number of intersections we want to plot
themes = upset_default_themes(text = element_text(size = 20)) # Set the plot text size to 20
)
From our upset plot, we can see there are 3 regions.
The left-hand barplot denotes the total number of observations in each set/category. In this case we’re talking about experimental conditions.
The bottom plot graphically represents the different combinations of each category.
The upper barplot displays the number of occurences for the combination displayed in the bottom plot.
From our set sizes, it looks like a L2FC value of > 2 was perhaps too stringent for the IAV and MERS-CoV DE data. Since we did filter our genes initially by the best hits in our SARS-COV-2(A549) dataset, we can see that it has the highest frequency group at 34 genes with over expression in just this set. The next largest overlapping set is of 14 genes between the SARS-CoV-2(A549) set and the RSV(A549) DE data.
You can also see from the plot that there are only 9 combinations of
datasets with any real overlap. Had we set min_size = 1
instead, the remainder of the grouping combinations would be seen to
contain just a single gene in each.
Overall this can make for a simplified interpretation of overlapping genes versus more complicated Venn diagrams!
Today we took a long look at some popular visualizations for expression data sets generated by an experiment like RNAseq. Our analyses focused more on highlighting aspects of the differential expression information itself from a big-picture low-resolution view like volcano plots to zooming down to the individual gene level with dot plots.
Next week we’ll look at examples of “big” data from an even broader sense by using dimensional reduction techniques like classifying our datasets into groups with principle component analysis.
This week’s assignment will be found under the current lecture folder under the “assignment” subfolder. It will include an R markdown notebook that you will use to produce the code and answers for this week’s assignment. Please provide answers in markdown or code cells that immediately follow each question section.
| Assignment breakdown | ||
|---|---|---|
| Code | 50% | - Does it follow best practices? |
| - Does it make good use of available packages? | ||
| - Was data prepared properly | ||
| Answers and Output | 50% | - Is output based on the correct dataset? |
| - Are groupings appropriate | ||
| - Are correct titles/axes/legends correct? | ||
| - Is interpretation of the graphs correct? |
Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.
You can save and download the markdown notebook in its native format. Submit this file to the the appropriate assignment section by 12:59 pm on the date of our next class: April 11th, 2024.
Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.2: edited and prepared for CSB1020H S LEC0141, 03-2023 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 2.0.0: Revised and prepared for CSB1020H S LEC0141, 03-2024 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Sankey diagram documentation: https://www.rdocumentation.org/packages/networkD3/versions/0.4/topics/sankeyNetwork
riverplot, another package for making your Sankey
diagrams: https://cran.r-project.org/web/packages/riverplot/riverplot.pdf
MA plots directly from two sets of expression data: https://rpkgs.datanovia.com/ggpubr/reference/ggmaplot.html
Generating goseq data for RNA analysis: https://bioconductor.org/packages/devel/bioc/vignettes/goseq/inst/doc/goseq.pdf
The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.
From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.
For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.